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In Bayesian analysis of multi-way contingency tables, the selec- 
tion of a prior distribution for either the log-linear parameters or the 
cell probabilities parameters is a major challenge. In this paper, we 
define a flexible family of conjugate priors for the wide class of discrete 
hierarchical log-linear models, which includes the class of graphical 
models. These priors are defined as the Diaconis-Ylvisaker conjugate 
priors on the log-linear parameters subject to "baseline constraints" 
under multinomial sampling. We also derive the induced prior on the 
cell probabilities and show that the induced prior is a generalization 
of the hyper Dirichlet prior. We show that this prior has several de- 
sirable properties and illustrate its usefulness by identifying the most 
probable decomposable, graphical and hierarchical log-linear models 
for a six-way contingency table. 

1. Introduction. We consider data given under the form of a contin- 
gency table representing the classification of N individuals according to a 
finite set of criteria. We assume that the cell counts in the contingency table 
follow a multinomial distribution. We also assume that the cell probabil- 
ities are modeled according to a hierarchical log-linear model. The class 
of discrete graphical models that are Markov with respect to an arbitrary 
undirected graph G is an important subclass of the class of hierarchical log- 
linear models, in part because graphical models can be interpreted in terms 
of conditional independences that can easily be read off of the graph, and in 
part because they allow for parsimony in the number of parameters in the 
analysis of complex high-dimensional data. We will therefore give special 
attention to the class of graphical models throughout the paper. 



Received March 2008; revised November 2008. 

^Supported by NSERC Discovery Grant A8946. 

AMS 2000 subject classifications. 62F15, 62H17, 62E15. 

Key words and phrases. Hierarchical log-linear models, conjugate prior, contingency ta- 
bles, hyper Markov property, hyper Dirichlet, model selection. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2009, Vol. 37, No. 6A, 3431-3467. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



H. MASSAM, J. LIU AND A. DOBRA 



In the Bayesian analysis of contingency tables, the selection of a prior 
distribution for either the log-linear parameters or the cell probabilities pa- 
rameter is a major challenge (see Clyde and George [3]). For decompos- 
able graphical models, Dawid and Lauritzen [8] have identified a standard 
conjugate prior which they called the hyper Dirichlet. The hyper Dirichlet 
presents the mathematical convenience of a conjugate prior; it has the fiexi- 
bility given by a number of hyperparameters (as many as there are free cell 
probabilities in the model) and, additionally, has the strong hyper Markov 
property. The latter is very desirable, since it allows for local updates within 
prime components, thus simplifying the computation of Bayes factors in a 
model selection process. For decomposable models, with the hyper Dirichlet 
as a prior, Bayes factors can be computed explicitly. The hyper Dirichlet has 
therefore been used in many studies (see, e.g., Madigan and Raftery [21] or 
Madigan and York [23]). However, it has the disadvantage of being defined 
only for the class of decomposable graphical models, which, as the num- 
ber of factors increases, becomes a smaller and smaller part of the class of 
graphical models and, even more so, of hierarchical models. 

Considerable efforts have been devoted to the study of alternative pri- 
ors valid for the larger class of hierarchical models. Knuiman and Speed 
[18], Dellaportas and Forster [9] and King and Brooks [17] propose various 
versions of a multivariate normal prior for the log-linear parameters. 

In this paper, we propose a new prior for the class of hierarchical log- 
linear models. This prior is the Diaconis-Ylvisaker conjugate prior for the 
log-linear parameters subject to baseline constraints. We show that it is a 
generalization of the hyper Dirichlet to nondecomposable graphical mod- 
els and, even more generally, to hierarchical log-linear models. We also show 
that, like the hyper Dirichlet, it has the advantage of being a conjugate prior 
while offering flexibility through its hyperparameters. We illustrate its appli- 
cability for the well-known Czech Autoworkers example previously analyzed 
by Edwards and Havranek [14], Madigan and Raftery [21] and Dellaportas 
and Forster [9]. We employed MC^ to explore the space of decomposable, 
graphical and hierarchical log-linear models for this six-dimensional binary 
table. Dobra and Massam [13] and Dobra et al. [11] demonstrate that our 
conjugate priors scale to higher-dimensional examples arising from social 
studies as well as gene expression and genomewide association studies. 

A secondary aim of this paper is to contribute to a discussion on a question 
asked by Gutierrez- Pena and Smith [15], itself motivated by a characteriza- 
tion given, for univariate natural exponential families (henceforth abbrevi- 
ated NEF) by Consonni and Veronese [5]. The latter proved that the prior 
induced onto the mean parameter of an NEF, from the Diaconis-Ylvisaker 
conjugate prior for the log-linear parameters, is standard conjugate if and 
only if the variance function of the NEF is quadratic in the mean. Leucari 
[20] showed that, in the case of a decomposable graphical model, the induced 
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prior on the mean parameter is standard conjugate even though the variance 
function is not quadratic, thus providing a negative answer to the question 
posed by Guitterez-Pena and Smith [15] as to whether the characterization 
of Consonni and Veronese [5] could be extended to multivariate NEF. Here, 
we show more precisely that the induced prior on the clique and separator 
marginal probabilities, which we will denote , is standard conjugate (it is 
the hyper Dirichlet as mentioned above) while the induced prior on the cell 
probabilities parametrization, denoted pxj, does not share the same prop- 
erty. This is achieved through the derivation of the prior induced on the cell 
probabilities from our prior on the log- linear parameters. 

The paper is organized as follows. In Section 2, we define the parameters 
we chose to use in order to express the multinomial distribution. They are 
the classical log-linear parameters defined by the "baseline" or "corner" con- 
straints, and we show that this is the parametrization obtained if we make a 
change of variable from the cell counts to the marginal cell counts. In Section 
3, we derive the Diaconis and Ylvisaker [10] (henceforth abbreviated DY) 
conjugate prior for this parametrization for hierarchical log-linear models, 
and we study its properties. We first characterize the set of hyperparameters 
for which our conjugate prior is proper. We then use this characterization 
to construct a set of hyperparameters that leads to a proper prior. We com- 
pute the moments of the prior cell probabilities that can be used to guide 
our choice of hyperparameters. Finally, we show that, like the hyper Dirich- 
let, our prior on the log-linear parameters has what we might call the strong 
hyper Markov property extended to nondecomposable models, so that in- 
ference in a Bayesian framework can be made prime component by prime 
component. In Section 4, we derive the induced prior for the cell probabili- 
ties for the decomposable graphical model, the nondecomposable graphical 
model and, more generally, the general log-linear hierarchical model. As men- 
tioned above, we discuss the conjecture of Gutierrez- Pena and Smith [15]. 
In Section 5, we present a comprehensive analysis of the Czech Autoworkers 
data that includes a sensitivity study about the influence of the conjugate 
prior specification on the highest posterior probability log-linear models. In 
Section 6 we briefly talk about additional developments using our conjugate 
prior. Major proofs of some of our results herein are given in the Appendix. 

2. The log-linear model. 

2.1. The parametrization. Let V be the set of criteria. Let X = (X^, I7 € 
V) such that takes its values (or levels) in the finite set of dimension 
|/^|. When a fixed number of individuals are classified according to the 
\V\ criteria, the data is collected in a contingency table with cells indexed 
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by combination of levels for the \V\ variables. We adopt the notation of 
Lauritzen [19] and denote a cell by 

The count in cell i is denoted n(i), and the probability of an individual 
falling in cell i is denoted p{i). For E CV, cells in the E'-marginal table are 
denoted is ^ = X ^^^I-y and the marginal counts are written 

(2.1) n(iE)= ri{iE,jv\E)- 

For N = J2i£i'n'{i), (n) = {n{i),i E T) follows a multinomial A4{N,p{i),i £ Z) 
distribution with probability density function 

Let i* be a fixed but arbitrary cell which, for convenience, we take to be the 
cell indexed for each factor by the "lowest level" itself indexed, for conve- 
nience again, by 0. Thus, i* is the cell 

r = (o,o,...,o). 

We now have to choose a parametrization for the log-linear model; that 
is, a parametrization for logp(i). As shown in Darroch and Speed [7] (see 
also Lauritzen [19], Appendix B.2), each possible metric in the space M"^ of 
real- valued functions defined on X corresponds to a different parametrization 
of the log-linear model. Moreover, as illustrated in Wermuth and Cox [29], a 
given parametrization can be best suited to a given type of problem. There 
is, therefore, no "best" parametrization in general. 

In this paper, we choose to work with the parametrization given by "base- 
line" or "corner" constraints; that is, the parametrization that follows if we 
choose the "substitution weight" metric for the space M'^, as given in Sec- 
tion 3.1 of Darroch and Speed [7]. This parametrization has the practical 
advantage of yielding the marginal counts as the canonical statistic in the 
exponential family form of (2.2), thus making the derivation of the general 
form of marginal and conditional distributions, as well as that of conjugate 
distributions, very easy to express. The log-linear parameters are 

(2.3) eij(zs)= ^(-l)l^\^llogp(i^,i>.), 

F<ZE 

which, by Moebius inversion, is equivalent to 

(2.4) p{iE,i*E<^) = eyiV ^ OF{iF)- 

FCE 



(2.2) 



P((n)) 



A CONJUGATE PRIOR FOR DISCRETE MODELS 



5 



We note that 90{i*) = logp{i*),i G T and we will therefore adopt the notation 

(2.5) e0{i*)=e0, =p0 = exp6'0. 

The parametrization (2.3) was first used by Mantel [24]. It is used in most 
standard statistical software such as GLIM or R (see Agresti [1], page 150). 
It has recently been used in Consonni and Leucari [4] , in the case of binary 
data. It seems, however, that it is less commonly used in the literature 
than the so-called u-parametrization (see Bishop, Fienberg, and Holland 
[2]) though the "interaction" terms ^^(is) in (2.3) are easy to interpret as 
ratios of log-odds ratios or as partial cross-product ratios. Indeed, one can 
easily verify (see Lauritzen [19], page 37) that for any a,/? in E, with the 
notation E~ = E\{a, f3}, 0£;(i_E) can also be written as the alternating sum 
of conditional log-odds ratios 

6e{ie)= 2^ (-l)i^^'^ilog- 



FCE- 



Another pleasant feature of this parametrization is that, as we shall see more 
precisely at the beginning of the next subsection, the parameters given in 
(2.3) are obtained as the canonical parameters of the multinomial distri- 
bution when we make the change of variable from the cell counts to the 
marginal cell counts. In order to identify which ones of the 9E{iE) defined 
in (2.3) is a free parameter, we need the following lemma. 

Lemma 2.1. If for ^ e E, E CV we have = i* = 0, then OEiiE) = 0. 

Proof. By definition, and since (^fu75 ^(j7U7)'=) ~ {^F-,i*Fc) if «7 = «^ = 0, 
we have 

eE{iE)= E (-l)l^\^llogp(iF,i^.) 
FC£;\7 

- Yl (-l)'^^^'logp(iFU7,^(FU7)0 
FQEYi 

= E (-l)l''\^llogp(iF,i>.) 

- E (-l)l''\^llogp(iF,i>c) 

FQEYi 

= 0. □ 

From this lemma, it follows immediately that our parametrization is in- 
deed the "baseline" or "corner" constraint parametrization that sets to the 
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values of the £^-interaction log-linear parameters when at least one index in 
E is at level (see Agresti [1], page 150). Therefore, for each E CV, there 
are only 1176-5(1-^71 ~ 1) parameters and for any E ^V, we introduce the 
convenient notation 

(2.6) rE = {iE\u,^i;,\f-feE}. 

In words, 1^ the set of marginal cells ie such that none of their com- 
ponents is equal to 0. We set ly =I \ {i*}- For example, if = {a,b,c}, 
a takes the values {0,1,2,3}, b takes the values {0,1,2}, c takes the values 
{0,1}, then 

I*E = {(1, 1, 1), (2, 1, 1), (3, 1, 1), (1,2, 1), (2, 2, 1), (3, 2, 1)}. 
It will also be convenient to introduce the notation 

(2.7) £e = {ECV,E^0} 

for the power set of V deprived of the empty set and the notation £ for the 
power set of V. 

By (2.4) and (2.5), we have 

P0 = 1 - p(^) 



(2.^ 



1- XI J2 PiiE,ih) = '^- J2 expJ2^Fi^F) 

1- E E e^p(^''+ E MiF)) 

E&SeiE&It FCE,F^lZ ' 



= i-P0 E E ^^p( E ^piip) 

In order to simplify our notation, from now on, we will use 

FQqE 

to express that F is included in E but is not equal to the empty set and, for 
€ Ie^ E £ £, the notation 

i{E) = {iE,i*Ec), 

which is not to be confused with i^;, the E marginal cell. We will also write 
9{iE) for OEiiE)- Then, (2.8) yields 

(2.9 p{i{E)) = —— — — Ee£. 

1 + EEeSe ^jEex*E "^"^P^^fcqE ^Uf)) 

We note, in particular, that 
(2.10) p0 



1 + Ebg^q SiEeX* exp(EFCei? ^Uf)) ' 
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2.2. The multinomial distribution for discrete data. We now want to give 
the probability density function of the multinomial distribution under the 
form of an exponential family when the statistical model is a hierarchical log- 
linear model. Let us first show that the parameters in (2.3) are the canonical 
parameters of the multinomial distribution for the saturated model after we 
make the change of variables 

(2.11) (n) = {n{i),i er)^Y = {y{iE) = n{iE),E ££e,iEe 1%) 
from joint cell counts to marginal cell counts as defined in (2.1). 

Lemma 2.2. The probability function of the multinomial distribution as 
given in (2.2) can be represented as a natural exponential family, with canon- 
ical parameters 6{iE), E € £Q,iE 1^ defined in (2.3) and with canonical 
statistics the marginal cell counts {n{iE),E £Q,iE (zT%), as follows: 

np(i)"«=exp| J2 E riiiEMiE) 

(2.12) 

- A^lo 



g(i+ E E E 



Proof. We have 

ViviS)'''^''=vf*^ n n f(^(^))"«^^) 

n{i(E)) 



nil*) 



n n (expE^(^^))' 



E&£eiE&T% ^ FCE 

E^-^E 



n n ^Mn{i{E)) Y (^{iF))p^ 

EG£eiEei*E ecqE 



^(n+T.E^e^T..^^^,n(i{E)) 



exp Y E E (^('F, 

Ee£QiEei*E^ fcqE 



:p0exp Y E ^{^E)0{iE) 

E€£e iE&1% 

>{ E E 0{iE)n{iE)+Ne^\, 

^E££QiE&*E 

where the second equality is due to (2.4), the third to (2.5), the fourth to 
the identification of the exponent of pgj as the total count A'^, the fifth to 
(2.1) and the sixth to (2.5) again. Finally, (2.12) follows from (2.10). □ 
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From the change of variable (2.11) and Lemma 2.2, it follows immediately 
that the family of distributions of Y is the natural exponential family 

explEfiege d{iE)y{iE)} 



(2.13) 



where is a reference measure of no particular interest to us here. This gives 
us the density for the saturated model. 

Let us now consider the hierarchical log-linear model generated by the 
class A = {Ai, . . . , Ak} of subsets of V, which, without loss of generality, we 
can assume to be maximal with respect to inclusion. We write 

(2.14) V = {EQq Ai for some i = l,...,k} 

for the indexing set of all possible interactions in the model, including the 
main effects. It follows from the theory of log-linear models (see also Darroch 
and Speed [7]) and from Lemma 2.1 that the model for the cell counts p{i) 
is the log-linear model with generating class A if and only if the following 
constraints are satisfied 

(2.15) 0{iE)=O, EiV. 
Therefore, in this case, for ie ^1%^ (^-4) becomes 

(2.16) log^= E Oi^F). 

FCE,F£V 

Let us now consider an undirected graph G with vertex set V. Darroch, 
Lauritzen and Speed [6] have shown that, for the subclass of graphical models 
Markov with respect to G, the generating class is equal to the set of cliques 
of G, that is, the set of maximal complete subsets of G. Therefore, for this 
subclass, 

(2.17) 2:' = {i:)CeyjD complete}. 

In general, for the class of hierarchical models with generating class A, the 
nonzero free log-linear parameters are 

(2.18) ev = {eiiD),DeV,iDelh}- 

Let us adopt the short notation 



FCj,D 
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to indicate that F Cq D and F ^T). Then, for the hierarchical log-hnear 
model, (2.9) and (2.10) become 

2.19) p(z(E =— — — 57-^' ^e^e> 

(2.20) P0 ^ 



Through an argument parallel to that given in Lemma 2.2, it follows that, 
in the case of a log-linear model with generating class A, the family JT^ in 
(2.13) becomes 

(2.21) J'^.„ = {My■,0v)My),OveR^^}, 

where, as in the saturated case, the measure fJ-viv) is of no particular interest 
to us here, 9x> = {0{iD),F) G 15, is ^ ^h) '^^ canonical parameter, the 
dimension dxi of the parameter space is equal to (see Darroch and Speed [7], 
Proposition 4.3) 

dv= E n(iAi-i) 



and 



(2.22) 



N\og(l+Y. exp ^(^f))| 



It is important to note here that, correspondingly to (2.18), only cell proba- 
bilities of the form p{i{D)),D G G 2"!), will be free probabilities, since, 
by Lemma 2.1, all others can be expressed in terms of 

(2.23) pv = ipiiiD)),DGV,iDeIh), 

which will be the cell probability parameter of the multinomial distribution 
of the hierarchical log-linear model. 

When the data is binary, that is, when \I^\ = 2,7 G 1/, there is only one 
element in for each E G £q; therefore, since 6{iE) is zero \i ie ^ 1%^ 
can use the simplified notation 9{E), p{E) and y{E) =n(E), respectively, 
for the canonical parameters 9{iE) in (2.3), the cell probabilities p{i{E)) in 
(2.9) and the marginal counts in (2.11) in all the formulas above. 

We note that (2.23) becomes pv = {p{D),D£ V). 

Let us illustrate this notation with an example. Let G be the graph 
with vertices a,b,c,d and edges (a, 6), (6, c), (c, d) and {d,a). In this case. 
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the graphical model is actuahy the same as the hierarchical model with 
generating class equal to the set of cliques A= {ab,bc,cd,da}, and we have 

V = {a, b, c, d, ab, be, cd, da}, 

£q = {a, b, c, d, ab, be, cd, da, ac, bd, abc, bed, cda, dab, abed}. 

The linear constraints on 6e,E are 

9{ac) = e{bd) = 6{abc) = e{bcd) = 9{cda) = e{dab) = 9{abcd) = 

and the constraints on the cell probabilities are as follows: 

p{a)p{c) p{b)p{d) 

p{ac) = , p{bd) = , 

P0 P0 

p(ab)p(bc) p{bc)p{cd) 

P[abc) = — , p{bcd) = — , 

p{b) p[c) 

p{cda) = P-^^^^, p(da6) - ^(^"^^^'^'^ 



p{abcd) 



p{d) p{a) 
p{ab)p{bc)p{cd)p{da)p0 



p{a)p{b)p{c)p{d) 
According to (2.20), we have 

_^ ^e{b)+e{c)+e{bc) _^ ^e{c)+e{d)+e[cd) j_ ^e{d)+e{a)+e{da) 

^e{a)+e{b)+e{c)+e{ab)+e{bc) _^ ^e{b)+e{c)+e{d)+e{bc)+e{cd) 
_l_ ^e{c)+e(d)+e(a)+e{cd)+e{da) _|_ ^e{d)+e{a)+e(b)+d(da)+e{ab) 
_l_ ^e(a)+e{b)+e(c)+e{d)+e{ab)+e(bc)+e(cd)+e{da) 

The other cell probabilities can be written in terms of 9 according to (2.19). 

3. The conjugate prior for the log-Hnear parameter 6. From (2.22), it 
is clear that, for the three nested classes of models considered in this paper 
(graphical with respect to G decomposable, graphical with respect to an 
arbitrary undirected graph G and hierarchical) the conjugate prior for 9x>, 
as given by Diaconis and Ylvisaker [10], is given immediately by its density 
with respect to the Lebesgue measure 

'i^v{9v\s,a) = Iv{s,a)~^ 
(3.1) xexpjE E G{iD)s{iD) 



alog(l+ J2 E E ^(^^) 
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where lD{s,a) is the corresponding normahsing constant and 

(3.2) {s,a) = {s{iD),DeV,iD(^Ih,Oi), sGM'^^, aGM, 

are the hyperparameters. 

In order to be able to use this prior in practice, we need to answer a 
number of questions. The first basic question is to know for which values 
of the hyperparameters {s,a) the distribution is proper; that is, when does 
Ix>{s,a) < +00 hold. Next, we can ask how to construct such hyperparame- 
ters. We address these two questions first and then we will give an example 
showing how to choose {s,a) to refiect prior knowledge. 

Lemma 3.1. The prior distribution (3.1) with hyperparameters 
as defined in (3.2), is proper if and only if ^ belongs to the V-marginal cell 
probability space of ; that is, if and only if a > and there exists an 
array of real numbers p{j) > 0, j € I such that 

(3.3) s{iD) = a E pO')' DeV,iD€lh, 

jD=iD 

where, for E ^ 8, 



(3.4) p{i{E)) 



for some 9x> € ^ 



Proof. Since the parameter space of (2.21) is Qt> = M^^, by Theorem 
1 of Diaconis and Ylvisaker [10], a necessary and sufficient condition for 
Ivis, a) to be finite is that a be a positive scalar and that — s = —{s{iD),D € 
Djio ^ Td) be in the interior of the convex hull of the support of pxi- Since 
the Laplace transform 

L^^{e-D)=(l+ E E exp( E ^(Jf)''"^ 

is defined on 0x), which is open, the interior of the convex hull of the sup- 
port of fix> is equal to the mean space Mp of Tf^j,. We therefore want to 
identify Mx>. Let k^^{9v) = logLf^^{9x>). S ince J^fj.'p is a natural exponential 
family with parameter 9d € we have 



(3.5) 



m{zn) = E{n{zn)) = N^^^ = N ^ p{j)\, 
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where p{j) are as in (2.19). Therefore, ^^,De V,iDeTl, must have the 
same properties as J2j£X,ji^=ii^pU)j ^^^^ the lemma follows. □ 

From Lemma 3.1, we know that, for vrx)(0x'|(s, a)) to be proper, we can 
choose (s, a) so that ^ = ^{s{iD),D € "D, iD £ ^h) ^ the P-marginal proba- 
bilities of a fictive probabihty table with cell probabilities of the form (2.19) 
and (2.20) or equivalently the cell probabilities of a hierarchical log- linear 
model with generating set V. 

A first way to build the cell probabilities of such a fictive contingency 
table, that is to obtain (s,a), is to follow the lemma above: 

1. Choose an arbitrary 6x> = {9{iD),D &'D,i£) & I}))- 

2. For each i = i{E),E G Sq^ie define ^ to be equal to the right- 
hand side of (2.19), and to the right-hand side of (2.20) for E = 0. This 
defines for all i S T. 

3. For DeV,ineIh,let '-^ = E,ei,jn=^n'^- 

4. Choose a > arbitrarily and derive s = {s{iD),D G I^, ^ ^d) from step 
3 above. 

We note here that these hyperparameters are consistent across models, in the 
sense that the fictive marginal counts for different models can be obtained 
from a single = {9(iD),D Cq V,i (zI). For each model, that is, each V, we 
then build the cell probabilities of a fictive table of counts through steps 1-3 
above and the cell counts through step 4. 

A second way to construct the cell probabilities of the fictive contingency 
table is to start with an arbitrary given contingency table with all cell counts 
(v) = {i^{i£)),D Cq V,iEl) positive, not necessarily integers. Let a denote 
the total count in that table. The maximum likelihood estimate pD of px) 
of the fictive table cell probabilities satisfying the constraints of the model 
and the likelihood equations 

u{iD) = a p{j), DeV^iD^Ih^ 

j£l,jD=iD 

exists; therefore, s = {u[i£,),D € 15, iz) G X^) and a satisfy the conditions of 
Lemma 3.1. The hyperparameters obtained by this second method are also 
consistent across models in the sense that they are obtained from one single 
arbitrary given table of counts. As will be illustrated in the Spina Bifida 
example of Section 3.1, the first method can be very convenient. 

The choice of a > in the methods given above is indeed arbitrary, but 
it is not innocent in the sense that, given a model determined by the 
choice of a can change the shape of the prior distribution vrx)(^i)|(s, a)) 
and thus affect the posterior density and further inference. The following 
example illustrates our point. We will also see the impact of the choice of a 
in Section 5. 



A CONJUGATE PRIOR FOR DISCRETE MODELS 13 

Example 3.1. Let us consider the graph G which has the vertex set V = 
{a, 6, c} and two chques {a, 6} and {6, c}. Let us also assume, for simphcity, 
that each variable Xa,Xft,Xc is binary. The cell probabihty parameter px>, 
as defined in (2.23), is 

Pv = {p{a),p{b),p{c),p{ab),p{bc)). 

We consider the graphical model Markov with respect to G. It is difficult to 
see the impact of the choice of a on vrx)(^x)|(s, a)), since we are not familiar 
with this distribution, but things are clearer when we look at the induced 
density on px>- As we shall see in Example 4.1, the density induced from 
T^viSvliSja)) on px> is equal to 

ttUpvUs, a)) = (lv{s, a) (l - ^^'^ ^ ^ 



P0 

V a-s(a)-s{b)-s{c)+s{ab)+s{bc)-l 
X P0 

In order to obtain the hyperparameters, following the second method given 
above, we can take a fictive probability table with all entries equal. If we 
choose a = l, then (s(a), s(c), s(a6), s(6c)) = |(4, 4, 4, 2, 2), while, if we 
choose a = 16, (s(a), s(c), s(a6), s(6c)) = 2(4, 4, 4, 2, 2). The correspond- 
ing conjugate priors for px> are, respectively, 

11111^ 



■^viP-D 



2' 2' 2' 4' 4 



,1 



pI 



ocp(a)-6/8p(6)-ip(c)-6/8^(a6)-7/«p(6c)-7/«('l - ^^^^^ 
7rP(pi,|((8,8,8,4,4),16)) 

oc p{afp{b)~^p{cfp{ab)p{bc) (l - E^^^ 



-1 



p(a)30/8^,(c)30/8p(a&)l5/8p(6c)l5/^ 



The ratio of the two densities is 
Jp((2(4,4,4,2,2),16)) 
/i,(l/4(2,2,2,l,l),l) 

This ratio varies as pD varies, and the two densities clearly have very different 
shapes and, therefore, give more prior weights to different pp- 

Let us note here that for the example above, the underlying graph is 
decomposable and, as we shall see further in Section 4.1, in that case, the 



14 



H. MASSAM, J. LIU AND A. DOBRA 



prior Tr^{px>\{s , a)) coincides with the Hyper Dirichlet defined by Dawid 
and Lauritzen [8]. Using the notation of Section 4.1 and Proposition 4.1 
and cahing Ci the clique {a,b}, C2 the clique {b,c} and S the separator 
{b}, the ratio above can be written as the ratio of two hyper Dirichlet with 
hyperparameters equal to, for a = 1, 

a^\D) = l, DCCi, 1 = 1,2, a^(6) = a^(0) = i, 
and, for a = 16, equal to 

a^'{D) = A, DCCi, 1 = 1,2, = a^(0) = 8. 

The ratio is therefore equal to 

r(16)r(8)2r(l/4)8 \p^^{ab)p^^ {a)p^^ {b)p^' (0^1 15/4 



r(i)r(i/2)2r(4)^ 



p^{b)p^{0) 



■pC2 (5c)p'^2 (5)pC2 (-c)pC2 (0) 



p^{b)p^{0) 



15/4 



Though expressed here in the more familiar marginal clique and separator 
marginal cell probabilities, this second expression of the ratio of prior den- 
sities may be more difficult to apprehend than the first one in terms of cell 
probabilities. 

We note also that for the saturated model, the prior with a = 1 and 
all Active cell counts equal, is the prior advocated by Perks [25] (see also 
Dellaportas and Forster [9]). 

3.1. Moments of the cell probabilities. We now compute the moments of 
the cell probabilities. As we show below through an example, these moments 
can, in some instances, be used to guide our choice of hyperparameters when 
we have prior information. 

Proposition 3.1. Consider the distribution 7rx)(0i?|(s,Q;)) as defined in 
(3.1). Let r be a positive integer. Then, for D ^'D,iD ^ the rth moment 
of the generalized odds ratio is 



^FCD 



Iv{sD,a) 



(3.6) 

Iv{s,a) 

where the components of so ore equal to those of s except for 



~SD{i{D))=s{i{D))+r. 
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Moreover, for all E ^ 8, the rth moment of the cell probabilities p{i{E)) is 
(3-7) ) = , . , 

1X>\S, CX) 

where the components of se.t equal to those of s, except for 
SE,MF)) = sii{F))+r, ifFC^E. 

The proof of this proposition is simple and is omitted. Equation (3.7) 
follows from the fact that 

(3.8) p^i^E)) =e'^'''^^^^^T,E'^'^^ = e^^^T,^'^'^^-'^^('^\ 

As we shall see in Section 4, the normalising constant Ix> can be computed 
explicitly when the model is Markov with respect to a decomposable graph 
G. Otherwise, the normalising constants have to be computed numerically 
by the usual approximation methods. 

We now show through an example how the results in Proposition 3.1 
above can be used to guide our choice of (s, a) in the prior distribution. We 
consider the data given by Hook, Albright and Cross [16] and used by King 
and Brooks [17] to illustrate the fact, as we do it here, that with their prior 
they can translate prior information into values for the hyperparameters. In 
this dataset, there are three variables a, b and c each taking the values 1 or 
representing the presence or absence of, respectively, birth certificates, death 
certificates and medical records for each individual. The individuals under 
study are children with spina bifida. The data consists of an incomplete con- 
tingency table for each one of six years. From Hook, Albright and Cross [16], 
it can reasonably be assumed that the model is the decomposable graphical 
model with cliques {a} and {b,c}. Consultation with experts suggests that 
the interaction between factors b and c is negative and the presence of this 
negative interaction is expected to create a relative decrease in the (be) cell 
probability by a proportion in the interval [0.1,0.9]. It was also expected 
that the total number of babies born with spina bifida during the study 
period would lie in the interval [9,56], and it was thought reasonable that a 
prior mean number of babies should lie in the interval [29,35]. 

Let us now express this prior information in terms of restrictions on (s, a). 
Since a can be thought of as the total count for a fictive prior contingency 
table, the belief about the total count could be immediately translated as, 
say, a = 30, which lies in the interval [29,35]. To reflect the negative in- 
teraction between factors b and c, we chose O^c negative, and to reflect the 
fact that the negative interaction given by 9i,c would cause the ratio of the 
expected value of pbc under V = {a,b,c} and the expected value of pbc un- 
der V = {a,b,c,bc} to be in the interval [0.1,0.9], we want the values of 
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9{a),9{b),e{c),9{bc) to satisfy 

/ N Eo(p(bc)) 
3.9 0.1 < ^°7;,\\ <0.9, 

where Eint{pE) denotes the expected value of pE under Tra^b,c,bc and Eq{pe) 
denotes the expected value of pE under nafi^c- Equation (3.7) gives the two 
expected values in (3.9) in terms of the normalising constant of (3.1). Since 
the chosen model, with main effects and be interaction, is a decompos- 
able graphical model, the normalising constant Ix>{s,a) can be obtained 
explicitly. Its formula is given further in Proposition 4.1. Using this for- 
mula, for both the model with and without the be interaction, that is with 
V = {a, 6, c, bc\ or V = {a, 6, c}, respectively, and also using Proposition 3.1, 
straightforward calculations yield 

^ / /, ^^ / s{a) \ s(bc) 
(3.10) ^ " ^ " 

Eoip{bc)) = (i 



a J a a 



From Lemma 3.1 and from the first method given in Section 3.2, we know 
that if we generate arbitrary values of 6t> = {9a,9b,6c,9bc) and compute the 
quantities 



for each E G {a, b, c, ab, ac, be, abe}, then the prior (3.1) with hyperparameter 
(s, a) defined by 

(3.12) &-=yp{F), D£V = {a,b,c,bc}, a > 0, 
a — ' 

FDD 

is proper. We can generate the 9x> in any way. Here, we choose to generate 9bc 
from a normal with mean —1.12 and variance | and to generate 9a, 9b and 9c 
from independent normals with mean and variance 1. We constrain those 
values to satisfy (3.9) expressed in terms of 9a,9b,9c,9bc using (3.10), (3.11) 
and (3.12). We choose, arbitrarily, the following set of values satisfying the 
required constraints: 

e{a) = -0.1200, 9{b) = 1.1100, 

0(c) = -0.0100, 9{be) = -1.8800. 

This yields values of s as follows: 

s(a) = 9.2324, s(6) = 16.4593, s(c) = 16.4476, s(6c) = 6.9874, 
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which, together with a = 30, defines a proper prior (3.1) reflecting our prior 
beUefs (3.9) and a G [29,35]. 

Clearly, if the prior information suggests that the model is not decompos- 
able, the moments can no longer be obtained explicitly. For given values of 
(s,a), lTi{s,a) has to be computed numerically and the formulae in Propo- 
sition 3.1 could only be used to verify that a particular choice of {s,a) is in 
accord with our prior belief. However, a first choice of (s, a) could be made 
from an approximating decomposable model. 

3.2. The strong hyper Markov property for graphical models. Let us now 
assume that the multinomial distribution of the contingency cell counts is 
Markov with respect to an arbitrary undirected graph G. In this subsec- 
tion, we will show that the generalized hyper Dirichlet in (3.1) is strong 
hyper Markov in the following sense. Let Pi,...,Pk a perfect sequence of 
the prime components of G, and let S2,---,Sk be the corresponding sep- 
arators. Though the prime components do not have to be complete, the 
separators Si = (Uj=i Pj) r)PiJ = '2,---,k are complete by definition. We 
will use the notation 

Ri = Pi\(^\JPj^=Pi\Si, 1 = 2,. ..,k, 

for the residuals, the notation 

V^', l = l,...,k, V^',V^\ l = 2,...,k, 

for the collection of complete subsets of the induced graphs Gpi,Gsi,Gfii, 
respectively, and the notation 

^(p^O, i = i,...,k, 

(3.13) 

e{is„V^'), is.els,, l = 2,...,k, 

for, respectively, the log-linear parameters of the P^-marginal multinomial 
and of the -R;-conditional multinomial given the value isi of the S'^-marginal 
cell. More precisely, these parameters are 

e{is,,v'^') = (0^'i*«'(ifl),Dc^fl, Ri,iD€ih), 

where 

0^'{io) = logl[{p^^i^P,^*P^^^))(-'y^''\ 

FCD 

e^'\^^^ (in) = log n ip'''^'''iiFXR,\p))^-'^'"'"' 

FCD 
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and p^' denotes P/-marginal probabilities andp^''*^' denotes i?;-conditional 
probabilities given the values is^ of the S^-marginal cell. 

We will say that (3.1) is strong hyper Markov with respect to G if, under 
(3.1), the variables 

9{VP^), e{is,,V^'), is, els„ 1 = 2,. ..,k 

are mutually independent. This is clearly a generalization of the strong hyper 
Markov property as given by Dawid and Lauritzen [8] . We have the following 
result. 



Theorem 3.1. If Od follows the generalized hyper Dirichlet as defined 
in (3.1), then the joint distribution of the parameters in (3.13) has density 

nt2V.(^^S«) ntiexp{(g(P^'),s(P^O)-afc(g(P^O)} 

nf=i/^p.(s^',«) Y\t2^Mm'^),s{vs^))-amvs.))] 

/-pPi(s'Pi,a) 

(3.14) xn n 



k ^ 



I^R,{sRi,s{isi)) 



1=2 is els I 

xexp{{9{is„V'^^),.s{is„V'^^)) 

-siisMf^iis^V""^))}, 
where s^ = {s{i£)),D ^'D^,i£) £T^) for A = Pi,Ri or Si, and 

{9{is„v'''),s{is„v'"))= Yl E 0'''^'^i{in)s{is,,iD), 



k{e{is,,V'")) = log(l+ E exp E ^""'''"'(^f) 



The parameters in (3.13) are therefore independently distributed; that is, 
(3.1) is strong hyper Markov. 

The proof is long and tedious but without conceptual difficulties and is 
ommitted here. It is important to note that s(i£)), Z) € 2?, i_D € T^j is always 
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the marginal count in the Active contingency table attached to the prior, 
whether it occurs in the marginal distribution of Pi, the marginal distribu- 
tion of Si or the conditional distribution of Ri given 1$^ ■ 

4. The induced prior on the cell probabilities. In this section, we give 
the expression of the induced conjugate prior in terms of the cell probability 
parameter, first for graphical models Markov with respect to a decomposable 
G showing that we obtain the hyper Dirichlet, then for general hierarchical 
models, which includes, in particular, models Markov with respect to an 
arbitrary graph G. The proofs of all our results are given in the Appendix. 

4.1. Decomposable graphical models. We first consider the case of the 
multinomial Markov, with respect to the decomposable graph G with set 
of cliques C = {Ci, I = 1, . . . ,k} and set of minimal separators S = {Si, I = 
2,...,k}, so that T> is the set of all possible subsets of C. Since in this 
subsection we deal with joint cell probabilities as well as Q-marginal or Si- 
marginal probabilities, in the expression z|)c),p'^' (^d, i|)c),p'^' (i_D, ijic), 
it will be understood that we have D as a subset of, respectively, V, Ci and Si, 
and as the complement of D in V, Ci and Si. We also, temporarily, do not 
use the i{D) notation but rather the more explicit, albeit more cumbersome, 
{iD,ih-) for i{D). 

Dawid and Lauritzen [8] defined the standard conjugate prior in terms of 
the clique and separator cell probabilities 

P^'iiD,ih^), DeV^', l = l,...,k, 

(4.1) 

p'^'{iD,ih), D£V^^, 1 = 2,. ..,k, iD^Tl, 
and called it the hyper Dirichlet. Its density is equal to 

nil Dire, {P%' , P^' {iD ,ih&, ag' , a^' {io ,ihc),D eV^' ,iD elh) 



(4.2) 
with 



Dire, iP0' , P^' {iD,ih^); a^' (in ,ihc),D gV^' ,iD elh) 



ip%r 



with a similar expression for Dir^, and where the hyper parameters 

{a%\a^'{iD,i*D-)^D ^V^\iD ^Td) and 

(4.3) 
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are hyperconsistent in the sense that, if Si = CiCi Cj, the marginal distribu- 
tions on Si obtained from either of the chque marginal distributions on Cj 
or Cj are the same. 

In this subsection, we derive the prior induced from 7rx)(0©|s,a) in (3.1) 
by the change of variable from 9x>, as defined in (2.18), to p^, as defined 
below in (4.4). We choose to work with which is the cell parametrization 
expressed in terms of marginal clique probabilities rather than with px> as in 
(2.23) because we want to compare the induced prior on p^ with the hyper 
Dirichlet. The probabilities in (4.1) are not all free variables. One way to 
choose the free marginal probabilities is as follows: 

/ k 
p^=ipC'{iD,ih^),DGV('^\[jv'^^J = l,...,k, 

\ j=2 

(4.4) 

p^^iiD,ih^),DeV^',l = 2,...,k,iDelh 

The Jacobian of the change of variable 9x> '—^ p^ is given in the following 
lemma. 

Lemma 4.1. The Jacobian of the change of variables from 9x> = {6{iD), 
D eV, if) GX|)) as given in (2.3) to p^ as given in (4-4) 

nf=2 P'l' n DGVSl GX*^ P^' {^D , «flc ) ' 

This lemma has already been stated in Leucari [20] in a slightly different 
form, and we give it here without proof. The following proposition says that 
the induced prior on p*^ is the hyper Dirichlet, which was also given by 
Leucari [20]. Here, we additionally give the correspondence between (s,a) 
and (4.3). 

Proposition 4.1. When the graph G is decomposable with set of cliques 
{Ci, i = 1, . . . ,k) and sets of minimal separators {Si,i = 2, . . . ,k) , the conju- 
gate prior induced from (3.1) is identical to the hyper Dirichlet (4-2) with 
hyper parameters (4-3), where 

CiDFDD jF&1*p 

ijF)D=iD 

«+ E(-i)""E«fe)' 



(4.5) 



dO 



dp' 



P 



a^'^{iD,i*Dc) 



(4.6) 



Ci 

"0 
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a 




{jF)D=iD 



(4.7) 




«+ E (-1)"" E ^(^^)- 



Moreover, 



(4.8) 



/D(s,a) 



r(«) ni=2 r(ai' ) DDei^s. n^^ex' r(a^' (i^ , ij,.)) ' 



4.2. Connection with previous work. From Proposition 4.1 above, from 
the form (4.2) of the hyper Dirichlet and the form of the multinomial Markov 
with respect to G decomposable, we see that the prior induced on p*^ from 
the DY conjugate prior (3.1) on 9x) has the same form as the likelihood 
function in terms of p'^ . We say that this induced prior on p'^ is standard 
conjugate, following the definition of Consonni and Veronese [5] who showed 
that, for the one-dimensional NEFs, the prior induced from the DY conju- 
gate prior on the canonical parametrization 6 onto the mean parametriza- 
tion fj, is standard conjugate if and only if the NEF has a quadratic variance 
function. Gutierrez-Pena and Smith [15] studied the case of a multivariate 
NEF. They first defined two parametrizations (p and A to be conjugate if 
the standard conjugate family of priors on A was identical to that induced 
from the standard conjugate family on (p by the change of variable from (j) 
to A. They denoted this property <j) ^ X. They then showed, in their The- 
orem 1, that </) ^ A if and only if the Jacobian |^| is proportional to the 
likelihood for A. From Lemma 4.1 and their Theorem 1, we can then imme- 
diately obtain that the induced prior on p*^ is the hyper Dirichlet [though we 
cannot obtain (4.6) and (4.7)]. Gutierrez-Pena and Smith [15] showed that 
the characterization of Consonni and Veronese [5] could not be extended 
to multivariate NEF's and conjectured that, with an extended definition of 
conjugacy, quadratic variance functions could characterize multivariate NEF 
with 9 and n conjugate. 

The result in Proposition 4.1 provides a counterexample to this conjec- 
ture, since it is easy to see that the variance function of the NEF Markov 
with respect to G decomposable is not a quadratic function of fiv- Leucari 
[20] had already observed this and, also, that 6xi ^ fJ-v- We have proved 
here, in addition, that 6x> ^ fJ-v p^ ■ The parametrization p^ is of course 
different from p^ as defined in (2.23). This distinction is important, since 
px> is not a linear function of and, as we shall see in Example 4.1, the 
parametrizations Qx> is not conjugate to the parametrization p-o^ even in the 
case of a decomposable model. 
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4.3. Arbitrary graphical and hierarchical models. To obtain the conju- 
gate prior in terms of pv, we need to compute the Jacobian \^^\- Before 
doing so, we need to define the fohowing quantities. For CGV^H^Sjic^ 
IcdH^^h' let 

. 0, otherwise, 

be the entries of a Odgxi I-^dI ^ n/fg^ \^h\ matrix F, where the rows are 
indexed by ic ^Ic-iC and the columns by jh S I^-, H G S. We note 
that the definition of F imphes that 

(4.10) nic,jH)= E nic,jH)= e (-i)""-'- 



(4.9) 



F{icdH) 



(iH)c=ic 



In the case of binary data for T> and £ as given in Section 2.4, the matrix 
F is 



(4.11) 



/o 


1 











1 








1 








1 
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1 

















1 








1 


1 

















1 








1 


1 

















-1 





























-1 





























-1 





[o 























-1 




1 







1 





1 


1 


1 \ 







1 




1 


1 





1 


1 




1 







1 


1 


1 





1 







1 







1 


1 


1 


1 












-1 
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-1/ 



We also need the following two lemmas. Their proof is given in the Appendix. 



Lemma 4.2. Let G be a nondecomposable prime graph, and let D be as 
in (2.17). For the matrix F as described in (4-9), the sum of the entries in 
each column jn , j € '^h ,Hq£q is such that 

(4.12) E n^c,JH)= E (-l)l'^l~' = l, 

icei*.,cev cc^H 

if and only if the subgraph induced by H is decomposable and connected. 
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We are now in a position to give the expression of the Jacobian for general 
graphical and hierarchical models. Let 

Uq = {F ^£q\F is either nondecomposable or nonconnected}. 

Let U = UQyj{0} and 

(4.13) a{H)=( Y: i-l f^-'-l), HG£. 

Lemma 4.3. The Jacobian J{pv) = l^f^l of the transformation px> ^ 0v 
is equal to 

(4.14) j(pp)= n pi^mU- E E (-i)""'"' 

DeV ^ H&Sq FQt>H 

for general hierarchical models. 

In the particular case of graphical models, (4-14) becomes 

(4.15) j{pv)= n piKD))(p^- E mH))am 

Example 4.1. For the same model as in Example 3.1, the Jacobian 
(4.15) for the graphical model Markov with respect to G and binary data, 
is 

P{a)p{c) 
P0 

We note, in reference to our discussion in Section 4.2 that J does not have 
the same form as the likelihood which is proportional to 

p{aY''p{bf^p{cf^p{abf-^p{bcY^-pl^ , 

where Xa,Xb,Xc,Xab,Xbc and x^ are appropriate integers. Therefore, 6x> and 
Pt> are not conjugate parametrizations in the sense of Gutierez-Pena and 
Smith [15] as defined in Section 4.2, even in the decomposable case. 

We can now give the main result of this section; that is, the conjugate 
prior for pD induced from (3.1). 

Theorem 4.1. The conjugate prior distribution induced from (3.1) by 
the change of variable O-d ^ Pv is 

(4.17) ^UPv\{s,a)) = ^^^^l[ U P(^(D)r^'^''^^~'p%'~\ 



(4.16) J = p{a)p{b)p{c)p{ab)p{bc) ( p0 
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where 

(4.18) aii{D))= J2 E i-lf'^''s{zp), 

FDD,F£V JfGX* 
0>)d=«d 

(4.19) a^ = a-J2 E 

and where K{pxi) = tt ^(pt^) — anc? J{pv) is as in (4-14) for gen- 

eral hierarchical models and (4-15) for graphical models. 



This result follows immediately from the expression of the conjugate prior 
(3.1) in terms of Oj), (2.3) and Lemma 4.3. 

Example 4.2. When the graph is the four cycle with binary data as 
considered in Section 2.2, U = {ac,bd,abcd,0}. From (4.11), (4.13) and the 
constraints 9{E) = for E ^D, it follows that a{ac) = a{hd) = l,a{abcd) = 
— 1. Moreover, 

p{ac) _ p{a)p{c) p{bd) _ p{b)p{d) 
V<z pI ' P0 pI ' 

p{abcd) p{ab)p{bc)p{cd)p{da) 
P0 P{a)p{b)p{c)p{d) 

For 

00 = a — s{a) — s{b) — s{c) — s{d) + s{ab) + s{bc) + s{cd) + s{da), 
we have 

7r(p2,|(s,a)) =/G(s,a)-^p(a)^('^)-^('^'^)-^('^^)-V(6)^('')-^('^*)-'^(*'=)-i 

X p(^Qyi(')~^it'c)-s{cd)-lp^^y{d)-s{cd)~s{da)~lp^^f^y{ab)-l 

■xp{bc)<^^^~^p{cd)<^'^^-^p{day'^'^'''^-^ 



PaPc _ PbPd ^ p{ab)p[bc)p{cd)p{da ) \ ^ 
pI pI P{a)p{b)p{c)p{d) 



5. Example. Czech Autoworkers data. We illustrate the use of our new 

priors in model selection for a classical data set previously analyzed many 
times in the literature. We first describe our model selection procedure, and 
then we describe the data and the results of our model search. The C++ 
code for the implementation of our methods can be obtained upon request 
from the authors. 
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5.1. Bayesian model selection. The Bayesian paradigm to model de- 
termination involves choosing models with high posterior probability se- 
lected from a set M. of competing models. We associate with each can- 
didate model m € a neighborhood nbd(m) C M. Any two models in 
m,m' £ A4 are connected through a path m = mi, 1712, ■ ■ ■ , ruk = m' such that 
rrij G nbd(mj_i) for j = 2, . . . ,k. The MC^ algorithm proposed by Madigan 
and York [23] constructs an irreducible Markov chain mt, t = 1,2, . . . with 
state space M and equilibrium distribution {p{m\{n)) :m G M.}, where (n) 
is the data in the form of a multi-way contingency table, and p{m\{n)) is 
the posterior probability of ni. We assume that all the models are a priori 
equally likely; hence, p(m|(n)) is proportional with the marginal likelihood 
p((n)|m). 

If the chain is in state mt at time t, we draw a candidate model m' from a 
uniform distribution on nbd(mt). The chain moves in state m' at time t+1; 
that is, mt+i = m' with probability 



where # nbd(?n) denotes the number of neighbors of m. Otherwise, the chain 
does not move; that is, we set m^+i = mt- 

The marginal likelihood of a model m is given by the ratio of normalizing 
constants 



where Dm are the possible interactions in m as in (2.14). 

The evaluation of the marginal likelihoods and the specification of model 
neighborhoods is done with respect to the particular properties of the set of 
candidate models considered: 

1. Hierarchical log-linear models. We calculate the marginal likelihood in 
(5.2) through the Laplace approximation (see, e.g., Tierney and Kadane 
[28]) to the normalizing constants for the prior and posterior distribution 
of log- linear model parameters. The neighborhood of a hierarchical model 
m consists of the hierarchical models obtained from m by adding one of its 
dual generators (i.e., minimal terms not present in the model) or deleting 
one of its generators (i.e., maximal terms present in the model). For details, 
see Edwards and Havranek [14] and Dellaportas and Forster [9]. 

2. Graphical log-linear models. We evaluate the marginal likelihood in two 
different ways: (i) we use the Laplace approximation to the normalizing con- 
stants Ivmiy + s,N + a) and Ix>^{s,a) as we did in the hierarchical case; 
(ii) we decompose the independence graph Gm of m in its sequence of prime 
components and separators and compute /xi„ as in (3.14). Dobra and Fien- 
berg [12] describe efficient algorithms for generating such a decomposition. 



(5.1) 




(5.2) 



p((n)[m) =/x,,„(y + s,iV + a)//i,„(s,a) 
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The normalizing constants for the complete prime components and the sep- 
arators (which are required to be complete) can be obtained explicitly (see 
Proposition 4.1). The normalizing constants for the incomplete prime com- 
ponents are estimated with the Laplace approximation. The neighborhood 
of a graphical model is defined by the graphs obtained by adding or remov- 
ing one edge from Gm- Since each graph has the same number of neighbors, 
the acceptance probability (5.1) reduces to min{l, ^^^^yy^^y-}- 

3. Decomposable log-linear models. In this case, the marginal likelihood 
can be explicitly calculated as in Proposition 4.1. The neighborhood of a 
decomposable model m is given by those models whose independence graphs 
are decomposable and are obtained by adding or deleting one edge from 
Gm- Tarantola [27] provides algorithms for determining which edges can be 
changed in a given decomposable graph such that the resulting graph is still 
decomposable. The size of the neighborhoods of two decomposable graphs 
that differ by one edge is not necessarily the same; thus, the acceptance 
probability (5.1) does not simplify as it did for graphical log-linear models. 

5.2. Results. We study the 2^ Czech Autoworkers table from Edwards 
and Havranek [14]. This cross-classfication of 1841 men gives six potential 
risk factors for coronary trombosis: (a) smoking, {h) strenuous mental work, 
(c) strenuous physical work, (d) systolic blood pressure, (e) ratio of beta 
and alpha lipoproteins and (/) family anamnesis of coronary heart disease. 

In the absence of any prior information, we specify a proper conjugate 
prior for log-linear parameters through a fictive 2^ table with all entries 
equal to a/64 for some a > 0. All of the log- linear models are therefore con- 
strained to have the same effective sample size 1841 -|- a. We remark that this 
approach to constructing a conjugate prior is equivalent to eliciting hyper- 
Dirichlet priors (see Madigan and York [22]). While the hyper-Dirichlet pri- 
ors are restricted to decomposable log-linear models, the properties of our 
conjugate priors extend naturally to graphical and hierarchical log- linear 
models. 

For each a € {0.01, 0.1,1,2,3, 32, 64, 128} we perform separate searches as 
follows: (i) a search over decomposable graphical models, (ii) a search over 
graphical models with marginal likelihoods estimated through decomposing 
the independence graph in its prime subgraphs, (iii) a search over graphical 
models with marginal likelihoods estimated through a single Laplace approx- 
imation and (iv) a search over hierarchical log-linear models. The results are 
shown in Tables 1, 2, 3 and 4. The four searches are labeled, respectively, 
"Dec," "Graph./PM," "Graph./Lapl" and "Hierar." For each search type 
and each value of a, we run four separate Markov chains from a random 
starting model for 25,000 iterations with a burn-in of 5000 iterations. We 
give the models whose normalized posterior probabilities are greater than 
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Table 1 

The most probable log-linear models for a G {0.01,0.1} 



Search 


a = 0.01 




a = 0.1 




Dec. 


ac\bc\d\be\f 


0.278 


ac\bc\d\be\ f 


0.172 




ac\bc\d\e\f 


0.236 


ac\bc\be\de\ f 


0.156 




cic\bc\d\cie\f 


0.212 


(ic\bc\d\cie\ f 


0.131 




ac\bc\d\ce\ f 


0.147 


ac\bc\ae\de\ f 


0.119 




CLc\bc\d\e\f 


med 


ac\bc\d\e\f 


med 


Graph. /PM 


ac\bc\d\ae\be\f 


0.856 


ac\bc\d\ae\be\f 


0.380 




ac\bc\ae\be\de\f 


0.078 


ac\bc\ae\be\de\f 


0.344 








tic tvc ttct tit; ue J 






ac\bc\d\ae\be\f 


med 


ac\bc\d\ae\be\f 


med 


Graph. /Lapl 


ac\bc\ae\be\de\f 


0.393 


ac\bc\ae\be\de\f 


0.450 




ac\bc\d\ae\be\f 


0.336 


ac\bc\ad\ae\be\f 


0.184 




ac\bc\ad\ae\be\f 


0.160 


ac\bc\d\ae\be\f 


0.122 








ac\bc\be\ade\f 


0.067 




ac\bc\d\ae\be\f 


med 


ac\bc\ae\be\de\f 


med 


Hierar. 


ac\bc\ad\ae\ce\de\f 


0.251 


ac\bc\ad\ae\ce\de\f 


0.362 




ac\bc\ad\ae\be\de\f 


0.157 


ac\bc\ad\ae\be\de\f 


0.227 




ac\bc\ae\ce\de\f 


0.136 


ac\bc\ae\ce\de\f 


0.062 




ac\bc\d\ae\ce\f 


0.116 








ac\bc\ae\be\de\f 


0.085 








ac\bc\d\ae\be\f 


0.0725 








ac\bc\ad\ae\ce\f 


0.055 








ac\bc\ad\ae\ce\de\f 


med 


ac\bc\ad\ae\ce\de\f 


med 



0.05 as well as the median log-linear models that are labeled with "med." A 
median model contains those interaction terms having a posterior inclusion 
probability greater than 0.5. 

We compare our highest posterior probability models with the log-linear 
models identified by Dellaportas and Forster [9], who proposed a reversible 
jump Markov chain Monte Carlo with normal priors for log-linear parame- 
ters, and with the decomposable models selected by Madigan and Raftery 
[21], who employed a hyper-Dirichlet prior for cell probabilities. Our most 
probable decomposable model bc\ace\ade\f for a = 1,2 or 3 is the same de- 
composable model as the one identified by both Dellaportas and Forster 
[9] and Madigan and Raftery [21]. Our most probable graphical model 
ac\bc\be\ade\f for q = 1,2 or 3 in the "Graph. /Lapl" search is precisely the 
most probable model of Dellaportas and Forster [9] and is the second best 
model selected by Edwards and Havranek [14]. Similarly, our most prob- 
able hierarchical model ac|6c|ad|ae|ce|(ie|/ for a = 1,2 or 3 coincides with 
the model with the largest posterior probability identified by [9]. The same 
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Table 2 

The most probable log-linear models for a G {1,2} 



Search 


a = 1 




a = 2 




Dec. 


bc\ace\ade\f 


0.250 


bc\ace\ade\f 


0.261 




bc\ace\de\f 


0.104 


bc\ace\de\f 


0.177 




bc\ad\ace\f 


0.102 


bc\ace\de\bf 


0.096 




ac\bc\be\de\j 


0.060 


bc\ad\ace\j 


0.072 




bc\cLce\de\bf 


0.051 


bc\ace\de\b f 


0.065 




bc\ace\de\f 


med 


bc\ad\ace\de\ f 


med 


Graph. /PM 


ac\bc\ae\be\de\f 


0.446 


ac\bc\ae\be\de\f 


0.371 




ac\bc\ad\ae\be\f 


0.182 


ac\bc\ad\ae\be\f 


0.151 








fi ^\nf^ \ t:>\ric \ /I £>\rt T 
t(-0| ct-i ttc |Ufc. 1 ttt;| C J 


U. -LOU 




ac\bc\d\ae\be\f 


0.054 


ac\bc\ae\be\de\ef 


0.057 








ac\bc\ad\ae\be\bf 


0.055 




ac\bc\ae\be\de\f 


med 


ac\bc\ae\be\de\f 


med 


Graph. /Lapl 


ac\bc\be\ade\f 


0.301 


ac\bc\be\ade\f 


0.341 




ac\bc\ae\be\de\f 


0.203 


ac\bc\be\ade\bf 


0.141 




ac\bc\be\ade\bf 


0.087 


ac\bc\ae\be\de\f 


0.116 




ac\bc\ad\ae\be\f 


0.083 


ac\bc\be\ade\ef 


0.059 




ac\bc\ae\be\de\bf 


0.059 








ac\bc\ad\ae\be\de\f 


med 


ac\bc\be\ade\f 


med 


Hierar. 


ac\bc\ad\ae\ce\de\f 


0.241 


ac\bc\ad\ae\ce\de\ f 


0.175 




ac\bc\ad\ae\be\de\f 


0.151 


ac\bc\ad\ae\be\de\f 


0.110 




ac\bc\ad\ae\be\ce\de\f 


0.076 


ac\bc\ad\ae\be\ce\de\f 


0.078 




ac\bc\ad\ae\ce\de\bf 


0.070 


ac\bc\ad\ae\ce\de\bf 


0.072 




ac\bc\ad\ae\ce\de\f 


med 


ac\bc\ad\ae\be\ce\de\f 


med 



consistency of the results obtained holds for most of the highest probable 
models selected by us and by Dellaportas and Forster [9]. 

Tables 1, 2, 3 and 4 show the sensitivity of the highest posterior prob- 
ability models with respect to the choice of priors and to the class of log- 
linear models considered. For a fixed a, the highest probable models become 
sparser as we sequentially relax the structural constraints from decompos- 
able to graphical and hierarhical. We remark that the most probable graph- 
ical (hierarchical) models can be obtained from the most probable decom- 
posable (graphical) models by dropping some of the second-order interaction 
terms. Increasing a from 1 to 128 (i.e., increasing each fictive cell count from 
1/64 to 2) leads to the inclusion of additional terms in the highest probable 
models. We also remark that the two estimation methods for the marginal 
likelihoods of graphical models yield consistent results. For a = 0.01 and 
a = 0.1, we note that our results, though not abherent, are not as entirely 
consistent with the results obtained for a £ {1,2,3,64, 128} as these results 
are between themselves or with results obtained in previous studies. This 
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Table 3 

The most probable log-linear models for a £ {3, 32} 



Search 


a = 3 




a = 32 




Dec. 


bc\ace\ade\f 


0.312 


bc\ace\ade\bf 


0.136 




bc\ace\ade\bf 


0.155 


ace\bce\ade\bf 


0.098 




bc\ace\de\f 


0.107 


bc\ace\ade\f 


0.062 




bc\ace\ade\e f 


0.065 


abc\ace\ade\bj 


0.060 




bc\ace\de\bf 


0.053 


bc\ace\cLde\e f 


0.057 




bc\ace\ade\f 


med 


bc\ace\ade\f 


med 


Graph. /PM 


ac\bc\ae\be\de\f 


0.316 


bc\ace\ade\bf 


0.068 




ac\bc\ae\be\de\b f 


0.157 


ac\bc\ade\bde\bf 


0.050 






n 1 98 








ac\bc\ae\be\de\ef 


0.066 








ac\bc\ad\ae\be\bf 


0.064 








ac\bc\ae\be\de\f 


med 


bc\be\ace\ade\f 


med 


Graph. /Lapl 


ac\bc\be\ade\f 


0.339 


ac\bc\be\ade\bf 


0.188 




ac\bc\be\ade\bf 


0.172 


ac\bc\be\ade\f 


0.103 




ac\bc\ae\be\de\f 


0.077 


ac\bc\be\ade\ef 


0.079 




ac\bc\be\ade\ef 


0.072 


ac\bc\be\ade\af\bf 


0.057 








ac\bc\be\ade\bf\df 


0.052 




ac\bc\be\ade\f 


med 


ac\bc\be\ade\bf 


med 


Hierar. 


ac\bc\ad\ae\ce\de\f 


0.137 


ac\bc\ad\ae\be\ce\de\b f 


0.020 




ac\bc\ad\ae\be\de\f 


0.086 








ac\bc\ad\ae\be\ce\de\f 


0.075 








ac\bc\ad\ae\ce\de\bf 


0.069 








ac\bc\ad\ae\be\ce\de\f 


med 


ac\bc\ad\ae\be\ce\de\b f 


med 



is not surprising, since for values of a very close to 0, we encounter two 
potential problems: first, the unknown behaviour of the Bayes factor as a 
tends to and, second, the evaluation of the prior normalizing constant. 
The first problem is a very important general problem regarding the be- 
haviour of Bayes factors when the total "fictive cell counts" tends to (see 
Steck and Jaakkola [26] for related results on directed acyclic graphs). This 
problem needs careful study and will be the subject of further work for our 
particular prior distributions. In this paper, we confine ourselves to observ- 
ing a difference in the behaviour of the model selection results between the 
case a € {0.01,0.1} and the results obtained for other larger values of a. 
Regarding the second problem, we have observed numerically that, for a 
close to 0, the prior distribution for each 9d is very fiat; hence, the Laplace 
approximation is bound to yield poor results. 

A very interesting question is whether there exists evidence of a link be- 
tween / and the other five risk factors. Whittaker [30], page 263, chooses the 
graphical model abce\ade\bf that links / with b, strenuous mental work. The 
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Table 4 

The most probable log-linear models for a £ {64, 128} 



Search 


a = 64 




a = 128 




Dec. 


ace\bce\ade\bcf 


0.134 


ace\bce\ade\bcf 


0.359 




ace\bce\ade\bf 


0.118 


ace\ade\bcf\cef 


0.133 




ace\ade\bcf 


0.081 


abc\ace\ade\bcf 


0.105 




bc\ace\ade\bf 


0.071 


abc\abe\ade\bcf 


0.104 




abc\ace\ade\acf 


0.062 


ace\ade\acf 


0.089 




abc\ace\ade\bf 


0.055 


abce\ade\acf 


0.060 




abc\abe\ade\acf 


0.052 


ace\ade\bcef 


0.051 








ace\ade\abcf 


0.050 




bc\be\ace\ade\bf 


med 


be\ace\ade\bcf 


med 


Graph. /PM 


ace\bce\ade\bcf 


0.091 


ace\bce\ade\bcf 


0.280 




ace\bce\ade\bf 


0.080 


ace 1 bee \ade\ bde \ bcf 


0.138 




ri \ n rl \ rt T 
Uit^C \ LLUiC\tJiL. J 




n \ n ri <i \ hr" t \ /^f^ t 
LC Ui tie j CC / 


n 1 04 








abc\ace\ade\bcf 


0.082 








abc\abe\ade\bcf 


0.081 








ace\ade\bcf 


0.070 




bc\be\ace\ade\bf 


med 


be\ace\ade\bcf 


med 


Graph. /Lapl 


ac\bc\be\ade\bf 


0.162 


ac\be\ade\bcf 


0.161 




ac\be\ade\bcf 


0.128 


ace\bce\ade\bcf 


0.114 




ac\bc\be\ade\af\bf 


0.068 


ac\be\ade\bcf\df 


0.109 




ac\bc\be\ade\bf\df 


0.068 


ace bee \ade\ bcf \ df 


0.077 




ac\bc\be\ade\ef 


0.068 


ac\ade\bcf\bef 


0.069 




ac\bc\be\ade\f 


0.057 


ac\ade \ bcf \bef\def 


0.064 




ac\be\ade\bcf\df 


0.054 








ac\bc\be\ade\bf 


med 


ac\be\ade\bcf 


med 


Hierar. 


ac\bc\ad\ae\be\ce\de\bf 


0.023 


ac\ad\ae\be\ce\de\bcf\ef 


0.012 




ac\bc\ad\ae\be\ce\de\bf 


med 


ac\ad\ae\be\ce\de\bcf\df\ef 


med 



most probable models identified by Edwards and Havranek [14], Madigan 
and Raftery [21] or Dellaportas and Forster [9] indicate the independence of 
/ from the other risk factors. Their findings are consistent with the models 

Table 5 

Posterior inclusion probabilities for edge bf 



a 



Search 


0.01 


0.1 


1 


2 


3 


32 


64 


128 


Dec. 


0.002 


0.022 


0.149 


0.219 


0.261 


0.49 


0.645 


0.918 


Graph./PM 


0.002 


0.033 


0.152 


0.222 


0.263 


0.476 


0.608 


0.898 


Graph. /Lapl 


0.027 


0.080 


0.205 


0.260 


0.296 


0.570 


0.716 


0.940 


Hierar. 


0.028 


0.084 


0.227 


0.297 


0.343 


0.708 


0.867 


0.995 
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we identify for smaller values of a. However, as we increase the grand total 
a in the prior fictive table we use, a direct link between h and / appears in 
our highest probable models. Table 5 shows the posterior inclusion proba- 
bility of the first-order interaction between h and / for various choices of a 
and structural model constraints. Table 5 seems to confirm Whittaker's find- 
ings, because the posterior probability of the edge 6/ increases from 0.002 
to almost 1. This edge does not appear in sparser models corresponding to 
smaller values of a, because there are stronger associations among a, 6, c, d 
and e than between / and another risk factor. 



6. Further developments. The family of conjugate priors introduced in 
this paper has a large area of applicability. Dobra and Massam [13] make 
use of these priors to analyze eight and 16-dimensional contingency tables. 
Due to the inherent sparsity of such datasets, penalizing for increased model 
complexity is key and can be done naturally in the framework we have de- 
scribed. The same priors are used in Dobra et al. [11] to develop variable 
selection approaches for regressions induced by log-linear models. Their ex- 
amples involve data from genomewide studies. 



APPENDIX 

A.l. Proof of Lemma 4.1. Let /ix) = {/^(^d) = D G P, G 2|)} 
denote the mean parameter of (2.21). Leucari (2004) has proved that the 
Jacobian |^^| can be expressed as the inverse of the right-hand side of (4.5) 
for the binary case. The proof can immediately be extended to the discrete 
data case. To complete the proof of Lemma 4.1, it remains to show that the 
Jacobian of [Id ^ pg is equal to 1. This is immediate, since the parameters 
in (4.4) are such that 

p'''{^D,^h^)= E E (-l)'^'M^D,iF), 

FQCi\DjF&*p 

p''{^dXd^)= E E (-i)"''M^i^,JF). 

FQSi\DjF&'F 

The Jacobian of fij) ^ Pg is therefore, 1 and the lemma is proved. 



A.2. Proof of Proposition 4.1. The distribution of Y in (2.21) is Markov 
with respect to G; therefore, 

(A.l) p(^)-ntip^'fc) 
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Since we are dealing with joint as well as marginal probabilities, we revert 
to the more precise notation {iF,i*Fc) rather than i{F). Then, 

FCE 

Ck k \ 

1=1 1=2 / 

1=1 ^FCE ^ 
-E(E(-l)"'^'^'log/'(^Fn5.,^>.n5,)) 

1=2 ^FCE ^ 
k k 

1=1 1=2 

U ECCi, EnCi = E and 9^'{iEnc,) = 0^'{iE). li E ^ d, then, by Lemma 
2.1, 9^'-{iEnCi) ~^ similarly, for 9^^{iEnSi)- therefore have 

k k 

(A.2) 0(iD)=E^^'(^^5)-E^^'(^D), DeV, 

1=1 1=2 

where 

0'''iiD)= E (-l)l''\''llogp^'(^F,WJ for Z?CC„ 
FCD 

0g'=logpg% 

and similar expressions for 9^^ {in) (see also Consonni and Leucari [4] for the 
derivation of these formulas in the case of binary data). For the remainder 
of this proof, it will be understood that, for E C Ci, we use the notation 

p'"^{iE,'i'*Ec) for p^'-{iE,i*Ecf^Ci)- Now, from (A.l), we also have 

k k 

iogp0 = E iogP0 - E ^^sp0 ■ 

1=1 1=2 

Therefore, (3.1) can be written as 

TTG{Ovip^)\s,a) 

oc 



1=1 ^D£V^i io&'u ^ECD ^ 
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k 

X 



1=2 ^DeV^i io&o ^ECD ^ 



(A. 3) +alogp0 

nzL2 explEijgDS; Ei^eJi; a^' (^s, logp^'{iE, igc) + a| logp|'} 

nt2(pl')"^' Hi^eP.-, U^,eX*^{p''i^E,^*E.)r"'^'^''*-^^ ' 

where a*-^' (i^;, z^c), a*^' i^c), and 00 are as defined in (4.6) and (4.7). 

The induced prior on is obtained by multiplying (A. 3) by the Jacobian 
(4.5) and it follows immediately that it is the hyper Dirichlet with hyper 
parameters as given in (4.6) and (4.7). 

The expression of (4.8) is obtained by noticing that, for any C/, 

"0 + E E a'^'iiE,i*Ec) = a = a0' + ^ ^ a'^'{iE,i*Ec). 

This completes the proof. 

A. 3. Proof of Lemma 4.2. For ease of notation, we will give the proof of 
the lemma in the case of binary data. Given definition (4.9), the proof for 
binary and discrete data are exact parallel of each other. Since, for binary 
data, for each C (z V and H ^ Eq there is only one cell in and X^, 
respectively, we will adopt the notation 

Fc,H = F{ic,jH). C£V,Hg£. 

Let us first prove that, if H is decomposable, (4.12) is true. We proceed by 
induction on the number k of cliques of H. Let C = {Ci, . . . , Ck} be a perfect 
ordering of the cliques of H. 

If H is complete, that is, k = 1, we consider two cases, the case where 
\H\ is even and the case where it is odd. For \H\ = 2p,p G N, there are rig = 

X]fc=i ('2^') nonempty subsets of H of even cardinality and no = Z]fc=o (2fc+i) 
subsets of odd cardinality. Therefore, 



2p 

E 

CCTyH k=l 



2p 
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and (4.12) is verified. We omit the proof for the case \H\ = 2p — 1, which is 
parallel to that of the previous case. Therefore, (4.12) is verified for k = 1. 

Let us now assume that H is decomposable but not complete, that is, 
A; > 1, and let us assume that (4.12) is true for any decomposable subset with 
k — 1 cliques. It is well known from the theory of decomposable graphs that, if 
we write Hk-i = Uj=i Cj, then H = Hk-i U {Ck \ Sk), where Sk = Hk-i n Ck 
is the /cth minimal separator in H. Therefore, we have 

(A.4) Fc,H= E Fc,H+(^ E Pc,H- E Pc,Hy 

The first term on the right-hand side of (A.4) is equal to 1 by our induc- 
tion assumption, while each one of the two other terms is also equal to 1, 
because both Ck and Sk are complete; therefore, (4.12) is also verified for 
decomposable H. 

Let us now prove that if H is not decomposable and connected, X^ccjjH Fc,h 
cannot be equal to 1. If is not connected and its connected components 
H^^\...,H^^\ for some Z > 2, are all decomposable, we clearly have 

E ^C,H = E( E Pc,Hu)^l- 
CCt^H j=l ^cCTyHO) ^ 

If H is not connected and its components are not all decomposable, this 
implies that there is a nondecomposable subset F\ of G, which can be sepa- 
rated from another subset F2 of G, but this contradicts our assumption that 
G is a prime component of G. So, this case does not occur. 

If H is not decomposable and connected, consider its set of cliques {Gi , . . . , 
Ck\. Since H is not decomposable, there is no perfect ordering of the 
cliques; therefore, for any given ordering, there exists a nonempty subset 
Q ^ {3, . . . , /c} such that, for j € Q, there is no i < j in the given ordering of 
the cliques of H with Sj = Cj n (Uz=i Q) ^ Ci- Therefore, 

= Cj n f U = > 2 < < j - 1 , 
\/=i / 1=1 

where the Sj^ can be chosen to be disjoints, with Sj^ C Cj n Cm for some 
mG{l,... J-1}. 

For J G Q = {2, . . . , fc} \ Q, there exists i < j in the given ordering of the 
cliques of H such that Sj C Gj. Therefore, 

(A.5) Pc,H= E Fc,H + Y_{ E Fc,H- E Pc,h) 

(A.6) + E f E ^c,// - E E i'c,h] . 
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The sums J2ccj,u ^C,H, U = Ci,Cj, Sj,j £ Q are all equal to 1, since each 
of Ci,Cj, Sj,j G Q are complete and connected; therefore, the right-hand 
side of (A. 5) is equal to 1. For the same reason, on line (A. 6), for U = 
Cj,Sjj,j e Q,Z = l,...,Sj,Ecci,c/^c,H = l- Since sj > 2, 

Therefore, the sum on line (A. 6) is less than or equal to — |Q|. It follows 
that 

E FC,H<0 
CC-pH 

and in particular it cannot be equal to 1. The lemma is now proved. 

A. 4. Proof of Lemma 4.3. The rows and columns of the Jacobian of the 
change of variables pT> i-^ ^x) is a dp x dj) determinant with rows and columns 
indexed by = {io, D £'D,i£) £ T^,} ordered in an arbitrary manner. For 
in G '?23) tlie Z£)-column is the vector of derivatives of with respect 

to 0{jc),jc £ Tj)- From (2.19), straightforward differentiation shows that 



{Ih)c=3C 



where 

1, if(iD)c=jc, 



%D)c(ic) 



0, otherwise. 



We note that the factor p{i{D)) is common to all components of the if) 
column and therefore the Jacobian is equal to 

J = det^ Yl p{i{D)), 

where A is the d-p x dp matrix with entries 

hD)ci3c) - E PiKH)), in e n, jc e T^. 

{^h)c=3C 

We also note that if C is maximal, with respect to inclusion, for the row 
r(jc) of A corresponding to jc ^T^t)-, the only entry for which S^^i^-^^^jc) 7^ 
is the diagonal entry; therefore, for C maximal, we can write 

(A.8) r(ic) = e(jc)-( E PiK^y, 

(lH)c=jc 
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where e{jc) is the dp-dimensional vector with components all equal to 
except for the jc component, which is equal to 1, and 1 is the vector dv- 
dimensional vector with all its components equal to 1. If C is not maximal, 
the entries corresponding to the columns i{D) with {10)0 = jc (which im- 
plies that C C D) also have 5(^i^-^^{jc) = 1. In order to eliminate, in the row 
f{3c) of A, the ^{iij)c{3c) = 1 for C strictly included in Z), we replace r(jc) 

by 

r{3c)=r{3c)+ E 

i)'F)c=3C 

fdc 

which yields, for any jc ^T-B, 

(A.9) f{jc) = e{jc)-( E (-l)""^^' E Pm))^- 

^FDC,FeV {Ih)c=3C 

Clearly, for C maximal, f{jc) =f{jc)- Moreover, the matrix A obtained by 
replacing f{jc) by r{jc) has the same determinant, as A and is equal to 

A = Ia^-Ul\ 

where U is the column vector 

u=( E (-i)'^^"" E pm)),jceTi). 

VdcfgB {lH)c=jc ^ 

It is well known that, for A of that form, det74 = 1 — that is, 

deti=i- E ( E (-1)"^^'^' E pim) 

jc<^r* Vdcfgb {iH)c=jc 



1- E pWh))( E E (-1 



\F\C\ 



= 1- E p(m) E (-1)"^'"' ' 

where the last equality follows from the general fact that X)cceF(~l)'^'^'^' — 
(— Therefore, (4.14) for the general hierarchical model is now proved. 

From (4.9) and Lemma 4.2, we see that in the particular case of graphical 
models, we have 

detA = p^+ E PiiiH))+ E P('(H)) 
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+ E pW^)) 1- E (-1)'^'^' 



FCt)H 



:p^- E \pm)HH)], 

H&Jq 



which proves (4.15). 
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